-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TTS Eval: Add TTS evaluation (MOS estimation) #2392
base: develop
Are you sure you want to change the base?
Conversation
Thank you @flexthink for your contribution! Having a model for MOS estimation is valuable for SpeechBrain. Here are some comments following an initial code inspection:
I recommend @BenoitWang reviewing this PR as well. His insights would be valuable. Thank you once again for your contribution. |
I guess this may not be the last version of the code since we have another branch where we did some latest experiments. I agree that we keep only the best recipe. Some of my observations in my benchmark:
As for the code, it seems good to me, the only thing I see here is that I don't find |
self.pool = StatisticsPooling(return_std=False) | ||
self.out_proj = Linear(n_neurons=1, input_size=d_model) | ||
|
||
def forward(self, wav, length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can just add these modules in the yaml and call them in compute_forward, like this it may seem more clear to the users.
return x | ||
|
||
|
||
def compute_feats_dim(model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the feature dim I think we can declare it in the yaml file for example base=768/large=1024 like we did for other ssl recipes.
|
||
d_model: 512 | ||
d_ffn: 2048 | ||
num_layers: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to update the best config later for example wavlm large + 3-layer encoder, as well as the lr and the dropout, etc.
What does this PR do?
Add TTS evaluation models trained on the SOMOS dataset
There should be no breaking changes
PR review
Reviewer checklist